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ABSTRACT We have developed three computer pro- 
grams for comparisons of protein and DNA sequences. They 
can be used to search sequence data bases, evaluate similarity 
scores, and identify periodic structures based on local se- 
quence similarity. The FASTA program is a more sensitive 
derivative of the FASTP program, whkh can be used to search 
protein or DNA sequence data bases and can compare a 
protein sequence to a DNA sequence data base by translating 
the DNA data base as it is searched. FASTA includes an 
additional step in the calculation of the initial pairwise simi- 
larity score that allows multiple regions of similarity to be 
joined to increase the score of related sequences. The RDF2 
program can be used to evaluate the significanceof similarity 
scores using a shuffling method that preserves local sequence 
composition. The LFASTA program can display all the re- 
gions of local similarity between two sequences with scores 
greater than a threshold, using the same scoring parameters 
and a similar alignment algorithm; these local similarities can 
be displayed as a "graphic matrix" plot or as individual 
alignments. In addition, these programs have b«en generalized 
to allow comparison of DNA or protein sequences based on a 
variety of alternative scoring matrices. 

Wc have been developing tools for the analysis of protein 
and DNA sequence similarity that achieve a balance of 
sensitivity and selectivity on the one hand and speed and 
memory requirements on the other. Three years ago, wc 
described the FASTP program for searching amino acid 
sequence data bases (1), which uses a rapid technique for 
finding identities shared between two sequences and exploits 
the biological constraints on molecular evolution. FASTP 
has decreased the time required to search the National 
Biomedical Research Foundation (NBRF) protein sequence 
data base by more than two orders of magnitude and has 
been used by many investigators to find biologically signifi- 
cant similarities to newly sequenced proteins. There is a 
trade-off between sensitivity and selectivity in biological 
sequence comparison: methods that can delect more dis- 
tantly related sequences (increased sensitivity) frequently 
increase the similarity scores of unrelated sequences (de- 
creased selectivity). In this paper we describe a new version 
of FASTP. FASTA, which uses an improved algorithm that 
increases sensitivity with a small loss of selectivity and a 
negligible decrease in speed. We have also developed a 
related program. LFASTA. for local similarity analyses of 
DNA or amino acid sequences. These programs run on 
commonly available microcomputers as well as on larger 
machines. 

METHODS 

The search algorithm we have developed proceeds through 
four steps in determining a score for pair-v^ ise similarity . 
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FASTP and FASTA achieve much of their speed and selec- 
tivity in the first step, by using a lookup table to locate all 
identities or groups of identities between two DNA or amino 
acid sequences during the first step of the comparison (2). 
The ktup parameter determines how many consecutive iden- 
tities are required in a match. For example, if ktup - 4 for a 
DNA sequence comparison, only those identities that occur 
in a run of four consecutive matches are examined. In the 
first step, the 10 best diagonal regions are found using a 
simple formula based on the number of ktup matches and the 
distance between the matches without considering shorter 
runs of identities, conservative replacements, insertions, or 
deletions (1, 3). 

In the second step of the comparison, we rescore these 10 
regions using a scoring matrix that allows conservative 
replacements and runs of identities shorter than ktup to 
contribute to the similarity score. For protein sequences, 
(his score is usually calculated using the PAM250 matrix (4), 
although scoring matrices based on the minimum number of 
base changes required for a replacement or on an alternative 
measure of similarity can also be used with FASTA. For 
each of these best diagonal regions, a subrcgion with maxi- 
mal score is identified. We will refer to this region as the 
"initial region"; the best initial regions from Fig. M arc 
shown in Fig. \B. 

The FASTP program uses the single best scoring initial 
region to characterize pair-wise similarity; the initial scores 
are used to rank the library sequences. FASTA goes one 
step further during a library search; it checks to see whether 
several initial regions may be joined together. Given the 
locations of the initial regions, their respective scores, and a 
"joining" penalty (analogous to a gap penalty), FASTA 
calculates an optimal alignment of initial regions as a com- 
bination of compatible regions with maximal score. FASTA 
uses the resulting score to rank the library sequences. We 
limit the degradation of selectivity by including in the 
optimization step only those initial regions whose scores are 
above a threshold. This process can be seen by comparing 
Fig. \8 with Fig. 1C. Fig. IB shows the 10 highest scoring 
initial regions after rescoring with the PAM250 matrix; the 
best initial region reported by FASTP is marked with an 
asterisk. Fig. 1C shows an optimal subset of initial regions 
that can be joined to form a single alignment. 

In the fourth step of the comparison, the highest scoring 
library sequences are aligned using a modification of the 
optimization method described by Needleman and Wunsch 
(5) and Smith and Waterman (6). This final comparison 
considers all possible alignments of the query and library 
sequence that fall within a band centered around the highest 
scoring initial region (Fig. 10). With the FASTP program, 
optimization frequently improved the similarity scores of 
related sequences by factors of 2 or 3. Because FASTA 
calculates an initial similarity score based on an optimization 
of initial regions during the library search, the initial score is 

Abbreviation NHRF. Nalional Biomedical Research Foundation. 
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Fu». 1. Identification of sequence similarities by FASTA. The 
four steps used by the FAST A program to calculate the initial and 
optim:il similarity scores between two sequences are shown. [A) 
Identity regions of identity. <tf) Scan the regions using a scoring 
matrix and save the best initial regions. Initial regions with scores 
less than the joining threshold (27) are dashed. The asterisk denotes 
the highest scoring region reported by FASTP. (O Optimally join 
initial regions with scores greater than a threshold. The solid lines 
denote regions that artfjoincd to make up the optimized initial score. 
(/)) Recalculate an optimized alignment centered around the highest 
scoring initial region. The dolled lines denote the bounds of the 
optimized alignment. The result of this alignment is reported as (he 
optimized score. 

much closer to the optimized score for many sequences. In 
fact, unlike FASTP. the FAST A method may yield initial 
scores that are higher than the corresponding optimized 
scores. 

Local Similarity Analyses. Molecular biologists arc otten 
interested in the detection of similar subsequences within 
longer sequences. In contrast to FASTP and FASTA. which 
report only the one highest scoring alignment between two 
sequences, local sequence comparison tools can identify 
multiple alignments between smaller portions of two se- 
quences. Local similarity searches can clearly show the 
results of gene duplications (see Fig. 2) or repealed struc- 
tural features (see Fig. 3) and are frequently displayed using 
a "graphic matrix" plot (7). which allows one to detect 
regions of local similarity by eye. Optimal algorithms for 
sensitive local sequence comparison 16. 8. 9) can have 
tremendous computational requirements in time and mem- 
ory, which make them impractical on microcomputers and. 
when comparing longer sequences, on larger machines as 
well. 

The program for detecting local similarities. LFASTA. 
uses the same first two steps for finding initial regions that 
FASTA uses. However, instead of saving 10 initial regions. 
LFASTA saves all diagonal regions with similariu scores 
greater than a threshold. LFASTA and FASTA also differ in 
the construction of optimized alignments. Instead of focus- 
ing on a single region. LFASTA computes a local alignment 
for each initial region. Thus LFASTA considers all of the 
initial regions shown in Fig. 1W. instead of just the diagonal 
shown in I'ig. W> Furthermore. LFASTA considers not 
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onK the band around each initial region but also potential 
sequence alignments for some distance before and alter the 
initial region. Starting at the end of the initial region, an 
optimization (hi proceeds in the reverse direction until all 
possible alignment scores have gone to zero. The location of 
the maximal local similarity score in the reverse direction is 
then used to start a second optimization that proceeds in the 
forward direction. An optimal path starting from the forward 
maximum is then displayed (5). The local homologies can be 
displayed as sequence alignments (see Fig. 28) or on a 
two-dimensional graphic matrix style plot (see Figs. 2/* and 

Statistical Significance. The rapid sequence comparison 
algorithms we have developed also provide additional tools 
for evaluating the statistical significance of an alignment. 
There are approximately 5000 protein sequences, with LI 
million amino acid residues, in the NBRF protein sequence 
library, and any computer program that searches the library 
by calculating a similarity score for each sequence in the 
library will find a highest scoring sequence, regardless of 
whether the alignment between the query and library se- 
quence is biologically meaningful or not. Accompanying the 
previous version of FASTP was a program for the evaluation 
of statistical significance, RDF, which compares one se- 
quence with randomly permuted versions of the potentially 
related sequence. 

We have written a new version of RDF (RDF2) that has 
several improvements. (/) RDF2 calculates three scores for 
each shuffled sequence: one from the best single initial region 
(as found by F AS TP). a second from the joined initial regions 
(used by FASTA). and a third from the optimized diagonal. 
Hi) KDF2 can be used to evaluate amino acid or DNA 
sequences and allows the user to specify the scoring matrix to 
be employed. Thus sequences found using the PAM250 
scoring matrix can be evaluated using the identity or genetic 
code matrix, iiii) 'The user may specify either a global or local 
shullle routine. 

Locally biased amino acid or nucleotide composition is 
perhaps the most common reason for high similarity scores 
of dubious biological significance (10). High scoring align- 
ments between query and library sequences may be due to 
patches of hydrophobic or charged amino acid residues or to 
A + T- or G + C-rich regions in DNA. A simple Monte Carlo 
shuffle analysis that constructs random sequences by taking 
each residue in one sequence and placing it randomly along 
the length of the new sequence will break up these patches of 
biased composition. As a result, the scores of the shuffled 
sequences may be much lower than those of the unshufflcd 
sequence, and the sequences will appear to be related. 
Alternatively, shuffled sequences can be constructed by 
permuting small blocks of 10 or 20 residues so that, while the 
order of the sequence is destroyed, the local composition is 
not. By shuffling the residues within short blocks along the 
sequence, patches of G + C- or A + T-rich regions in DNA, 
for example, are undisturbed..^ valuating significance with a 
local shuffle is more stringen'l than the global approach, and 
there may be some circumstances in which both should be 
used in conjunction. Whereas two proteins that share a 
common evolutionary ancestor may have clearly significant 
similarity scores using either shuffling strategy, proteins 
related because of secondary structure or hydropathic pro- 
file may have similarity scores whose significance decreases 
dramatically when the results of global and local shuffling 
are compared. 

Implementation. The FASTA/LFASTA package of se- 
quence analysis tools is written in the C programming lan- 
guage and has been implemented under the Unix. VAX/ 
VMS. and IBM PC DOS operating systems. Versions of the 
program that run on the IBM PC are limited to query sc- 
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Ig heavy-chain V-l I region 
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The average FASTP score = 26.1 z 6.8 (mean t SI)). Thc >% 
average FAST A score - 26.2 ± 7.2 (mean * SDK The mean and ' 
SO were computed excluding scores >54. V, Variable. 

quenccs of 2(XX) residues; library sequences can be any 
length. Copies of (he program arc available from the authors. 

Although FAST A and LFASTA were designed for protein 
and DMA sequence comparison, they use a general method 
that can be applied to any alphabet with arbitrary match/ 
mismatch scoring values. All the scoring parameters, includ- 
ing match/mismatch values, values for the first residue in a 
gap and subsequent residues in the gap. and other parame- 
ters that control the number of sequences to be saved and 
the histogram intervals, can be specified without changing 
the program. 

EXAMPLES 

Comparison of KASTA with FASTP. To demonstrate the 
superiority of the FASTA method for computing the initial 
score wc compared the protein sequence of a T-cell receptor 
a chain (NBRF code RWMSAV) with all sequences in the 
NBRF protein data base* and computed initial scores with 
both the present and previous methods. The T-cell receptor is 
. a member of the immunoglobulin superfamily; in Release 12.0 
of the data base, this superfamily has 203 members. FAS1 P 
placed 160 immunoglobulin superfamily sequences in the 200 
top-scoring sequences; 57 related sequences received initial 
scores less than four standard deviations above the mean 
score FASTA placed 180 superfamily members in the 200 
top-scoring sequences; only 20 related sequences scored 
below four standard deviations above the mean. Table 1 con- 
tains specific examples from this data base search. Although 
there is often little difference in the two methods, this ex- 
ample shows that in a number of cases the new method ob- 
tains significantly higher scores between related sequences. 

Nucleic Acid Data Base Search. FASTA can also be used to 
search DNA sequence data bases, either by comparing a 
DN A query sequence to the DNA library or b\ comparing an 
amino acid query sequence to the DNA library by translating 
each library DNA sequence in all six possible reading 
frames. We compared the 660-nuckotide rat transtorming 
growth factor type o mRNA (GenBank locus RATTGFA) 
with all the mammalian sequences in Relea>e 48 ot Gen- 
Bank* We set ktup = 4 (see Methods, and the search *as 
completed in under 15 min on an IBM PC AT microcom- 

: lutein Identification Resource iWS") Protein Sequence DuUh^c 
*Natl. Hiomed. Kc>. Found.. Washington. IX .. Relc^c 

U MBl GenBank Genetic Sequence Datable il^ . . tnie!l lt :cnet 
,cs Mountain V.e*. CAl. Tjpe Rtflt^e ^ 
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t able 2. UNA data base search of rat transforming growth factor 
»RATT(iFA > versus mammalian sequences 



Score 



Gen Hank 
locus 



Sequence 



Initial Optimized 



HUMTFGAM 
HUMTGFA2 
HUMTGFA1 
MUSRGKB3 

MUSRGK52 

MUSMHDD 

HUMMFTIK1 

MUSRGLP 

HUMPS2 

MUSCTA11 



Human TGF mRNA 
Human TGH gene (exon 2) 
Human TGF gene (V end) 
Mouse 18S-5.KS-28S rRNA 
gene 

Mouse 18S-.V8S-28S rRNA 
gene 

MHC class I H-2D 
Metallothionein {MT)I K gene 
45S rRNA (.V end) 
pS2 mRNA 
o-l type I procollagen 
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86 
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The 10 sequences having the highest initial scores are given. TGF. 
transforming growth factor; MHC. major histocompatibility com- 
plex. 

putcr. The It) lop-scoring library sequences are shown in 
Table 2 Although it can be seen that the 3 top-sconng 
sequences arc clearly related to RATTGFA, there are other 
high-scoring sequences that are probably not related, and the 
mouse epidermal growth factor, found in the translated data 
base search ( Table 3). is not found among the top-sconng 
sequences. 

To further examine the similarity detected between RA 1 - 
TGFA and MUSRGEB3. a mouse rRNA gene cluster, we 
used the RDF2 program for Monte Carlo analysis of statis- 
tical significance (the window for local shuffling was set to 10 
bases). Of the 50 shuffled comparisons (data not shown), 1 
obtained an initial score greater than 140 (the observed initial 
score), and 9 shuffled sequences obtained optimized scores 
greater than 107 (the observed optimized score). Therefore, 
the similarity between RATTGFA and MUSRGEB3 is un- 
likely to be significant. 

Translated Nucleic Acid Data Base Search. When searching 
for sequences that encode proteins, amino acid sequence 
comparisons are substantially more sensitive than DNA se- 
quence comparisons because one can use scoring matrices 
like the PAM250 matrix that discriminate between conserva- 
tive and nonconservaiive substitutions. A variant of FASTA. 
T FASTA, can be used to compare a protein sequence to a 
DNA sequence library; it translates the DNA sequences into 
each of six possible reading frames "on-the-fly." TFASTA 
translates the DNA sequences from beginning to end; it 
includes both intron and exon sequences in the translated 
protein sequence; termination codons are translated into 
unknown (X) amino acids. Table 3 shows the results of a 
translating search of the mammalian sequences in the Gen- 
Bank DNA data base using the RATTGFA protein sequence 
as the query and ktup = 1. In the translated search, the mouse 
epidermal growth factor now obtains an initial score higher 
than any unrelated sequences; however. HUMTGFAl, which 
was found in the DNA data base search but only contains 13 
translated codons. is no longer among the top scoring se- 
quences. 

Local Similarities. Fig. 2 displays the output of a local 
similarity analysis [ktup = 4) of CHPHBA1M. a chimpanzee 
nl-globin mRNA. and RABHBAPT. a rabbit o-globin gene, 
including the complete coding sequence and a flanking 
pseudo-tf,-globin gene. LFASTA can either display a graphic 
matrix style plot of the local homologies (Fig. 1A) or the 
alignments themselves (Fig. IB). The right-most three align- 
ments (Fig. 2A> match the corresponding regions of the 
mRNA to exon subsequences from the pseudogene. We note 
that the FASTA initial score for the comparison of CHPH- 
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Table ^ I ranslated l)S A J.tt.i 
mammalian sequences 
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MUSMMAH3 Mouse MHC class II H2-I A„ 

MUS1GCD17 Mouse Ip germ-line l)J( region 

HUMKSTR Human estrogen receptor 

RATI N SI Km insulin I ilns-h gene 

MUSTHYS1 Mouse thymidilate synthase 

HUMHNU3 Human p urine nucleoside phosphorylasc 

The 10 sequences having the h.ghcM iniiial scores are given. TCI'. I '«"^ i ^«^ 1 h J^^ 
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BA1M and RABHBAFF would be based on ihe three globin 
gene exons. while the FASTP initial score would be based on 
a single conserved exon. 

The Smith-Waterman optimization used in the Lr AST A 
program allows the detection of more subtle features than 
can be detected by the eye using a graphic matrix plot, 
because the path traced is locally optimal, even though it 
may onlytiave a slightly higher density of identities and 
conservative replacements. Fig. 3 shows a plot from a local 
similarity self-comparison of the myosin heavy chain from 
the nematode Carnorhahditis cleans (MWKW) using the 
PAM2M) matrix. The amino-terminal half of the molecule 
forms a large globular head without any periodic structure; 
the solid line down the main diagonal represents the ex- 
pected identity of the sequence with itself. The symmetrical 
parallel lines along the carboxyl-terminal half of the mole- 
cule correspond to the 28-residue repeat responsible lor the 
cr-helical coiled-coil structure of the rod segment. 

DISCUSSION 

In searching a data base, one is attempting to measure 
relatedness; in aligning two homologous sequences, one is 



trying to choose the most likely set of mutations since their 
divergence from a common ancestral sequence. Thus any 
tool for the analysis of sequence similarities must contain 
within it an implicit model of molecular evolution. An 
algorithm that guarantees the optimality of its alignments 
based on a set of scoring rules must be judged on how well 
these rules fit our current understanding of the process of 
molecular evolution. Algorithms that sacrifice realism to 
achieve greater efficiency, regardless of their mathematical 
rigor, require careful empirical evaluation. 

Even though the tools we have developed use rigorous 
algorithms at each step and incorporate a realistic model of 
evolution, their hierarchical nature make them heuristic. I he 
original FASTP program has had the benefit of extensive use 
and evaluation by a wide variety of scientists. The FAST A 
program exploits refinements of the previous approach that 
result in a significant improvement in sensitivity. The LFA- 
S'l'A local similarity analysis program is also a logical ex- 
tension of the FASTP approach. 

Because of the trade-olTs between sensitivity and selectiv- 
ity in data base searches, the results of any search, and 
particularly those that result in alignment scores that arc not 
clearly separated from the distribution of all library sequence 
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scores, must he carefully evaluated (Ml). The Monte Carlo 
analysis of statistical significance provided by a program 
such as KDF2 can often be critical in evaluating a borderline 
similarity. Previously we suggested ranges of; values llob- 
served score - mean of shuffled scorcsj/standard deviation 
of shuffled scoresl corresponding to approximate signifi- 
cance levels. However the z values determined in a Monte- 
Carlo analysis become less useful as the distribution of 
shuffled scores diverges from a normal distribution, as is 
found with PASTA. Therefore, we now focus on the highest 
scores of the shuffled sequences. For example, if in 50 
shuffled comparisons, several random scores are as high or 
higher than the observed score, then the observed similarity 
is not a particularly unlikely event. One can have more 
confidence if in 200 shuffled comparisons, no random score 
approaches the observed score. In general, our experience 
has led us to be conservative in evaluating an observed 
similarity in an unlikely biological context. 

These programs provide, a group of sequence analysis 
tools that use a consistent measure for scoring similarity and 
constructing alignments. FASTA. RDF2. and LFAS I A all 
use the same scoring matrices and similar alignment algo- 
rithms, so thai potentially related library sequences discov- 



Fig. V Repeated structure in the 
myosin heavy chain. LFASTA was used 
lo compare the Caenorhabditis elegant 
myosin heavy chain protein sequence 
(NBRF code MWKW) with itself using 
the PAM2M) scoring matrix. The solid, 
dashed, and dotted lines denote decreas- 
ing similarity scores. The solid lines had 
initial region scores greater than 80 and 
optimized local scores greater than 150; 
the longer dashed lines had initial region 
and optimised local scores greater than 
65 and 120. respectively, and the shorter 
dashed lines had initial region and opti- 
mized local scores greater than 50 and 
100. respectively. Homologous regions 
with lower scores are plotted with dots. 
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ered after the search of a sequence data base can be 
evaluated further from a variety of perspectives. In addition. 
LFASTA can also show alternative alignments between 
sequences with periodic structures or duplications. 
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